Skip to content

CoW: Push reference tracking down to the block level #51144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 8, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 3, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This pr removes the reference tracking from the manager level and adds it to the block level. Tests are passing locally.
Right now we are tracking the references also in non-CoW mode, but we are only accessing the object in CoW mode, so should not be harmful. Can also disable this, but wanted to avoid accessing using_copy_on_write() constantly.

Two things that the previous mechanism couldn't handle:

df = DataFrame({"a": [1, 2, 3]})
df = df.reset_index()
df.iloc[0, 1]

This triggered a copy, because the reference to the original df was kept alive. Same example: We were not able to free the memory of old object, if the result of an operation was assigned to the same variable.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! And a nice simplification ;)

I will only have time on Monday (during the day) for a thorough review, just a few quick comments

One thing I am wondering: because it is now at the Block level, it might be more dtype / Block-subclass specific, and we should maybe parametrize our basic copy_view tests more with different categories of dtypes (eg at least numpy based on extension/nullable) (something we should have done anyway, can also be done in a follow-up/precursor, happy to take a look at that)

@@ -59,8 +60,13 @@ class SharedBlock:
_mgr_locs: BlockPlacement
ndim: int
values: ArrayLike
refs: BlockValuesRefs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's maybe a good question: is it equivalent?
With the Blockmanager tracking, the rule was that each DataFrame had its own unique manager, so one manager was never shared between different dataframes. I assume similarly here, we always create new Blocks for new Managers/DataFrames (potentially just shallow copies). So if a certain values is shared (referenced) by 3 DataFrames, there are 3 unique Block objects that hold the values? Or differently stated, the referenced_blocks list, it's always all weakrefs to unique blocks in there? Or can there ever be two weakrefs to the same Block instance? Now this is done at the Block level, that can actually be OK to have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,

Yes, agreed that we certainly intent to always use shallow copies (new blocks)

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming

Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)

# TODO(CoW): This is tricky, if parent block goes out of scope
# all split blocks are referencing each other even though they
# don't share data
refs = self.refs if self.refs.has_reference() else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create kind of a "clone" of the refs (duplicate it)? Create a new refs that creates new weakrefs to the same objects. But that ensures that both lists are independent, and the two split blocks don't reference each other

Copy link
Member Author

@phofl phofl Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doable (in theory), but not in this pr (would make things way more complicated). The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object
You could have:

a -> b -> c -> split happens ->d1, d2, d3, ...

When you clone the reference object, you would have to make sure that a, b and c are all aware of the new object. The simplest solution would probably be, that a reference object can point to another reference object which would live with the split blocks. These chains can get arbitrarily deep though, so we have to think the exact approach through.

Good thing with this class is, that we can hide the implementation inside it without having to worry about it in other areas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object

Ah, yes, that's an aspect that I didn't consider when writing the above comment, and that indeed makes it quite a bit more complicated ..


Shorter term, we could maybe implement a small optimization in case the original dataframe doesn't have any references, then each of the sliced blocks will still have no foreign references?

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [3]: del df['b']

In [4]: df._mgr
Out[4]: 
BlockManager
Items: Index(['a', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
NumericBlock: slice(0, 1, 1), 1 x 10, dtype: float64
NumericBlock: slice(1, 2, 1), 1 x 10, dtype: float64
NumericBlock: slice(2, 3, 1), 1 x 10, dtype: float64

In [9]: df._mgr.blocks[0].refs.referenced_blocks
Out[9]: 
[<weakref at 0x7fe48cc6ec20; dead>,
 <weakref at 0x7fe48caaa0e0; to 'NumericBlock' at 0x7fe48cb8d9a0>,
 <weakref at 0x7fe48cb82040; to 'NumericBlock' at 0x7fe48cb87160>,
 <weakref at 0x7fe48cb82680; to 'NumericBlock' at 0x7fe48cb6f100>]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer doing this in a follow up. We are calling _slice_take_blocks_ax0 under the hood which would have to avoid tracking references via a keyword. Not super difficult but would make the pr messier than necessary. But will get a pr up before 2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, to be clear my "shorter term" was meant as "after this PR and if we don't directly have a better general solution for 2.0" ;)


if unfit_val_locs:
unfit_idxr = np.concatenate(unfit_mgr_locs)
unfit_count = len(unfit_idxr)

new_blocks: list[Block] = []
# TODO(CoW) is this always correct to assume that the new_blocks
# are not referencing anything else?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some tests in copy_view/test_setitem.py that checks we do copy the array-like values we are setting (not sure how complete they are though, given this is a beast of a method). And so because we copy upfront, that currently is fine regarding CoW (for setting full columns, which I think is what iset is used for?). Although a future optimization we can add is that we can actually delay the copy for Series/DataFrame being set as new column(s). And then we will have to passthrough its refs somehow to this method.

EDIT: Ah, I see it was my own comment that was moved ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep :) This is on my todo list as well, just moved it to a better place after the modification


df["b"] = 100
arr = get_array(df, "a")
df.iloc[0, 0] = 100 # Check that we correctly track reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is a bit confusing because it's testing we are not tracking a reference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the comment

@jorisvandenbossche
Copy link
Member

Some additional test ideas:

def test_out_of_scope():
    def func():
        df = DataFrame(..)
        # create some subset
        result = df[[cols]]
        return result
    df = func()
    # assert that df has no refs
def test_delete():
    df = DataFrame(np.random.randn(4, 3), columns=['a', 'b', 'c'])
    del df['b']
    # at this point, 'a' and 'c' each should only have 1 reference (in addition to itself)
    df = df[['a']]
    # 'a' should should be owned, and 'c' should have gone out of references (not sure if there is an easy way to test that)

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Feb 3, 2023
@phofl
Copy link
Member Author

phofl commented Feb 3, 2023

Agree with parametrising the tests, this was one my todo list as well. Would prefer to do in a separate pr though.

@phofl
Copy link
Member Author

phofl commented Feb 3, 2023

Added the tests of your two cases.

The del right now sets up references so that we think that there is another reference alive, because we are creating shallow copies under the hood.

Is there a way that delete would actually require us to keep references alive? Otherwise we could fix this in idelete on the manager level

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are relatively minor things, I think we can certainly try to merge this shortly so other CoW PRs don't need to be hold up.

Some comments I couldn't make inline:

  • Did you check all occurrences of Block.make_block() that the values are newly created and no refs need to be tracked?
    For example _maybe_downcast doesn't necessarily copy the data (but didn't check in detail where _maybe_downcast itself is being used; it might do that only in cases where we already have a copy). It's also used in astype, so this might something to handle in an updated version of the PR that handles astype. From a quick look the rest seems OK.
  • Block.convert() can currently return [self]. Should that be [self.copy(deep=False)]? Again it might depend on where this is actually called. For example it is called from infer_objects, but that's also something for which we have a follow-up PR. Other cases where this is called from within other Block methods might already have copied beforehand.
  • Can you add a comment to Block.iget() that it returns a view and that the caller needs to take care of keeping a reference if needed (I did something similar for BlockManager.iget_values, can copy from there)
  • Just a question: do we (eventually, later, not for this PR) move some of the CoW-triggers from the manager to block level as well?

Before merging: do we want to do the ref tracking only if CoW is enabled? I think that should be fine to keep as is (also before this PR, we were creating the refs list of weakrefs on BlockManager in all cases, while only using it when CoW is enabled (parent was only set if CoW was actually enabled, but that's because it was a hard reference. In this PR we only have weakrefs, so that should be fine)

And thanks for starting some developer-oriented documentation!

@@ -59,8 +60,13 @@ class SharedBlock:
_mgr_locs: BlockPlacement
ndim: int
values: ArrayLike
refs: BlockValuesRefs
Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,

Yes, agreed that we certainly intent to always use shallow copies (new blocks)

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming

Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)

# TODO(CoW): This is tricky, if parent block goes out of scope
# all split blocks are referencing each other even though they
# don't share data
refs = self.refs if self.refs.has_reference() else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object

Ah, yes, that's an aspect that I didn't consider when writing the above comment, and that indeed makes it quite a bit more complicated ..


Shorter term, we could maybe implement a small optimization in case the original dataframe doesn't have any references, then each of the sliced blocks will still have no foreign references?

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [3]: del df['b']

In [4]: df._mgr
Out[4]: 
BlockManager
Items: Index(['a', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
NumericBlock: slice(0, 1, 1), 1 x 10, dtype: float64
NumericBlock: slice(1, 2, 1), 1 x 10, dtype: float64
NumericBlock: slice(2, 3, 1), 1 x 10, dtype: float64

In [9]: df._mgr.blocks[0].refs.referenced_blocks
Out[9]: 
[<weakref at 0x7fe48cc6ec20; dead>,
 <weakref at 0x7fe48caaa0e0; to 'NumericBlock' at 0x7fe48cb8d9a0>,
 <weakref at 0x7fe48cb82040; to 'NumericBlock' at 0x7fe48cb87160>,
 <weakref at 0x7fe48cb82680; to 'NumericBlock' at 0x7fe48cb6f100>]

@@ -663,7 +626,7 @@ def consolidate(self: T) -> T:
if self.is_consolidated():
return self

bm = type(self)(self.blocks, self.axes, self.refs, verify_integrity=False)
bm = type(self)(self.blocks, self.axes, verify_integrity=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass shallow copies of the blocks in self.blocks here to keep track of references?

_consolidate_inplace will not update all blocks, so this might create a new BlockManager with blocks that shares data with the original BlockManager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a call where this would matter right now. It's either coming from copy or from a function where we did a copy of the blocks anyway. At least I couldn't find any. Could add it though if you prefer.

def test_assigning_to_same_variable_removes_references(using_copy_on_write):
df = DataFrame({"a": [1, 2, 3]})
df = df.reset_index()
arr = get_array(df, "a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arr = get_array(df, "a")
assert df._mgr.blocks[0].refs.has_reference() is False
arr = get_array(df, "a")

That's more explicit, but maybe redundant with the mutation check? (and also dependent on the order of the blocks in the result)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and also dependent on the order of the blocks in the result)

Could also be assert df._mgr._has_no_reference(1) to be robust for that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the last one, will add that

df["b"] = 100
arr = get_array(df, "a")
df.iloc[0, 0] = 100 # Check that we correctly removed the reference
assert np.shares_memory(arr, get_array(df, "a"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not super clear to me what this test is targeting. Does the implementation of df.__setitem__ create an intermediary Block that might incorrectly have a reference? Or what's the trigger to explicitly test this?
(not that anything about the test is "wrong", just wondering ;))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, setitem splits the blocks under the hood to avoid calling np.delete(). This is basically what we are also testing in the del df["a"] test above.

I want to make sure that the spit blocks don't have any references on them if the parent block did not have any reference. Parent block went out of scope, so the new blocks should be clear of references as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, thanks for the explanation, wasn't thinking about the side effect of the non-inplace __setitem__ of splitting the block

df.iloc[0, 0] = 100 # Check that we correctly release the reference
if using_copy_on_write:
mark = pytest.mark.xfail(
reason="blk.delete does not track references correctly"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How blk.delete involved here?

Copy link
Member Author

@phofl phofl Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same explanation as above. blk.delete splits the block during setitem -> all blocks are referencing each other and view -> view goes out of scope -> theoretically the split blocks shouldn't have references anymore because they don't share data with another block.

This is what we will have to optimise later.

I'll adjust the comment to better explain it

@phofl
Copy link
Member Author

phofl commented Feb 6, 2023

Quick comments to your top-comment:

  • I ensured that every function that was called behind a ref-update on the Manager level updates the refs correctly now. I went through all possible code paths down to the block level. The rest should be hidden behind general copies for CoW which we optimise later.
  • Same for the 2nd point, I think we make general copies there without checking anything
  • will do
  • yes definitely. Most of the checks are more specific on the block level and easier as well I guess.

Yeah I mentioned the tracking In my top post as well. Would make things unnecessarily complex (and slower) imo if we only do this for CoW. Accessing it is what could cause a slowdown, which is not done without CoW enabled

bool
"""
self.referenced_blocks = [
obj for obj in self.referenced_blocks if obj() is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a naive implementation question: Could a weakref.WeakSet be use to track the references such that

  1. We don't unnecessarily add more than one weakref to the same block
  2. The set will prune itself of refs that don't exist anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, yes. When I implemented it I wasn't sure if we potentially want to use the same block in 2 different objects, which would have made a list necessary. But we agreed above that this is not a good idea in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we agreed above that this is not a good idea in general.

While we agree we should ideally always use separate Block objects, changing this makes that a requirement, right? (because before you could have two identical Blocks in that list, although which means they would always cause the data to be referenced, even if the Manager that references one of either blocks would go out of scope. But at least it would ensure to not incorrectly not trigger CoW in such cases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be a requirement.

But right now we are doing this in all cases where the CoW optimisations are enabled. If we decide this is no longer a good idea, we could change the implementation back to a list since this has no impact outside of the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and to be clear I am fine with it being a requirement (I think that makes it clearer), but then we should maybe add this expectation to the docs you started. Maybe something like:

Whenever a DataFrame or Series object is sharing data with another object, it is required that each of those objects have its own BlockManager and Block objects. Thus, in other words, one Block instance (that is hold by a DataFrame, not necessarily for intermediate objects) should always be uniquely used for only a single DataFrame/Series object. For example, when you want to use the same Block for another object, you can create a shallow copy of the Block instance with block.copy(deep=False) (which will create a new Block instance with the same underlying values and which will correctly set up the references).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got you, that's a good idea. Added your suggestion to the docs

@phofl
Copy link
Member Author

phofl commented Feb 8, 2023

Merging now to unblock other stuff, can still optimise later if necessary

@phofl phofl merged commit 5d17d73 into pandas-dev:main Feb 8, 2023
@phofl phofl deleted the change_ref_tracking_for_cow branch February 8, 2023 09:01
------------------

To be able to determine, if we have to make a copy when writing into a DataFrame,
we have to be aware, if the values are shared with another DataFrame. pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after "aware" is unnecessary. also after "determine" on L17


We use a custom reference tracker object, ``BlockValuesRefs``, that keeps
track of every block, whose values share memory with each other. The reference
is held through a weak-reference. Every two blocks that share some memory should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two block -> pair of blocks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the comments. opened #51552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants